visual region and textual concept
Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations
In vision-and-language grounding problems, fine-grained representations of the image are considered to be of paramount importance. Most of the current systems incorporate visual features and textual concepts as a sketch of an image. However, plainly inferred representations are usually undesirable in that they are composed of separate components, the relations of which are elusive. In this work, we aim at representing an image with a set of integrated visual regions and corresponding textual concepts, reflecting certain semantics. To this end, we build the Mutual Iterative Attention (MIA) module, which integrates correlated visual features and textual concepts, respectively, by aligning the two modalities. We evaluate the proposed approach on two representative vision-and-language grounding tasks, i.e., image captioning and visual question answering. In both tasks, the semantic-grounded image representations consistently boost the performance of the baseline models under all metrics across the board. The results demonstrate that our approach is effective and generalizes well to a wide range of models for image-related applications.
Reviews: Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations
This paper describes a method for integrating visual and textual features within a self-attention-like architecture. Overall I find this to be a good paper presenting an interesting method, with comprehensive experiments demonstrating the capacity of the method to improve on a wide range of models in image captioning as well as VQA.The analysis is informative, and the supplementary materials add further comprehensiveness. My main complaint is that the paper could be clearer about the current state of the art in these tasks and how the paper's contribution relates to that state of the art. The paper apparently presents a new state-of-the-art on the COCO image captioning dataset, by integrating the proposed method with the Transformer model. It doesn't, however, report what happens if the method is integrated with the prior state-of-the-art model SGAE -- was this tried and shown not to yield improvement?
Reviews: Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations
The paper proposes a new method called Mutual Iterative Attention (MIA) for improving the representations used by common visual-question-answering and image captioning models. MIA works by repeated execution of'mutual attention', a computation that is similar to the self-attention operation in the Transformer model, but where the lookup ('query') representation is conditioned by information from the other modality. Importantly, the two modalities involved in the MIA operation are not vision and language, they are vision and'textual concepts' (which they also call'textual words' and'visual words' at various points in the paper). These are actual words referring to objects that can be found in the image. The model that predicts textual concepts (the'visual words' extractor) is trained on the MS-COCO dataset in a separate optimization to the captioning model Applying MIA to a range of models before attempting VQA or captioning tasks improves the scores, in some cases above the state-of-the-art. It is a strength of this paper that the authors apply their method to a wide range of existing models and observe consistent improvements.
Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations
In vision-and-language grounding problems, fine-grained representations of the image are considered to be of paramount importance. Most of the current systems incorporate visual features and textual concepts as a sketch of an image. However, plainly inferred representations are usually undesirable in that they are composed of separate components, the relations of which are elusive. In this work, we aim at representing an image with a set of integrated visual regions and corresponding textual concepts, reflecting certain semantics. To this end, we build the Mutual Iterative Attention (MIA) module, which integrates correlated visual features and textual concepts, respectively, by aligning the two modalities.
Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations
Liu, Fenglin, Liu, Yuanxin, Ren, Xuancheng, He, Xiaodong, Sun, Xu
In vision-and-language grounding problems, fine-grained representations of the image are considered to be of paramount importance. Most of the current systems incorporate visual features and textual concepts as a sketch of an image. However, plainly inferred representations are usually undesirable in that they are composed of separate components, the relations of which are elusive. In this work, we aim at representing an image with a set of integrated visual regions and corresponding textual concepts, reflecting certain semantics. To this end, we build the Mutual Iterative Attention (MIA) module, which integrates correlated visual features and textual concepts, respectively, by aligning the two modalities.